Graph theoretical analysis reveals the functional role of the left ventral occipito-temporal cortex in speech processing

The left ventral occipito-temporal cortex (left-vOT) plays a key role in reading. Interestingly, the area also responds to speech input, suggesting that it may have other functions beyond written word recognition. Here, we adopt graph theoretical analysis to investigate the left-vOT’s functional role in the whole-brain network while participants process spoken sentences in different contexts. Overall, different connectivity measures indicate that the left-vOT acts as an interface enabling the communication between distributed brain regions and sub-networks. During simple speech perception, the left-vOT is systematically part of the visual network and contributes to the communication between neighboring areas, remote areas, and sub-networks, by acting as a local bridge, a global bridge, and a connector, respectively. However, when speech comprehension is explicitly required, the specific functional role of the area and the sub-network to which the left-vOT belongs change and vary with the quality of speech signal and task difficulty. These connectivity patterns provide insightful information on the contribution of the left-vOT in various contexts of language processing beyond its role in reading. They advance our general understanding of the neural mechanisms underlying the flexibility of the language network that adjusts itself according to the processing context.


Supplementary Information
. Interpretations of graph theory terms.

Graph terminology Topological scale Interpretation
Global efficiency global Functional integration. Expresses information exchange between nodes by multiple parallel paths across the whole network.
Clustering coefficient global Functional segregation. Expresses information exchange by parallel paths between the nearest neighbors of nodes.
Modularity Q sub-network How well a network can be subdivided into non-overlapping sub-networks.
Community structure sub-network Sub-networks identified by community detection algorithm (partitions).

Flow coefficient nodal
How well a node can transfer information between its neighbors. Hubs with large flow coefficient are identified as local bridges in the network.

Betweenness centrality nodal
How often a node joins the shortest path between pairs of nodes in the network. Hubs with large betweenness centrality are identified as global bridges in the network.

Participation coefficient nodal
How well a node can coordinate the communication between different sub-networks. Expresses the number of connections linked to a given node across sub-networks.
Hubs with large participation coefficient are identified as connectors between sub-networks in the network.

Supplementary Table S2.
A ROI within the primary visual cortex (V1) and a ROI in the left posterior STG (pSTG), the latter being involved in both spoken and written language processing [1][2][3] , were selected from the set of 263 ROIs [4]   To ascertain that the results reported in the main document, where the analyses was conducted on the ROIs from Power et al. [4] , were not restricted to a specific set of ROIs or to a single density (15%), we conducted additional analyses: 1) across the range of densities 15%-22% with the FDR correction using Power et al.'s ROI [4] and 2) using the symmetrical ROIs set [5] across the range of densities 14%-18% with the FDR correction. As respectively shown in Supplementary Tables S3 and S4, the new analyses confirmed the patterns of functional connectivity reported in the main document. Moreover, we also applied community detection on the network density ranging from 2% to 10% with step 1% to confirm the stability of the modularity measure [4] . The result showed no differences in modularity Q across the baseline and four speech processing conditions (Supplementary   Table S5; all ps > 0.39), indicating a stable community structure across the range of sparser network density.
Network construction of the symmetrical ROIs set. A symmetrical set of 402 ROIs were taken from Di et al. [5] to confirm the global and nodal results reported in the main text while controlling the spatial bias in lateralization. The same procedure as described in the main text (see Network Construction) was performed on this set of ROIs. Specifically, the 402 ROIs with a radius of 5 mm were intersected with the group-averaged gray matter mask to exclude the ROIs that are outside the gray matter, resulting in 392 ROIs. The 10 ROIs that were removed were mainly subcortical nuclei located outside the gray matter mask. Additionally, the left-vOT identified in the visual localizer task and its homogeneous region in the right hemisphere were included into the symmetrical ROIs set. Finally, three ROIs in the left Fusiform gyrus and three ROIs in right Fusiform gyrus were further removed due to the overlap with functionally localized left-vOT and its homogeneous region, respectively.
Altogether, this resulted in 388 ROIs in the symmetrical set (see Supplementary Fig. S3 for the spatial distribution of this set of ROIs). The beta-series connectivity and Fisher-z transformation were then applied on the symmetrical ROIs set to construct a brain network for each condition and each participant. The density range here was identified between 14% and 18% at steps of 1%. The lower bound, 14% density, is the sparsest density at which 90% of all the networks are fully-connected, without any significant between-condition difference in the size of the LCC. The upper bound, 18% density, is the minimal density where all the networks are fully-connected networks.
Global network measures. As a demonstration, the results obtained using the symmetrical ROIs set at the sparsest density 14% are shown here to validate the results reported in the main document. At the global scale, the post-hoc tests confirmed that the global efficiency in the PN-(p < 0.040), CN-(p < 0.00066) and CN+ (p < 0.0036) conditions was significantly higher than in the baseline, while the difference between the PN+ and the baseline was marginally significant (p < 0.053). In terms of clustering coefficient, the post-hoc tests confirmed that the CN-(p < 0.0018) and CN+ (p < 0.0026) conditions are significantly lower than the baseline, while the difference between the baseline and the perception conditions are not significant (all ps > 0.083). Overall, the results confirmed that the speech processing led to higher global efficiency and lower clustering coefficient compared to the baseline.

Nodal measures of the left-vOT.
At the nodal scale, hubs are identified as the top 5% nodes among the 388 nodes (i.e., above 19 th out of 388). Supplementary Fig. S4A showed  (Supplementary Fig. S4B). However, in terms of flow coefficient, the left-vOT was not identified as a hub since its ranks were below 35 th in all conditions. It is worth noting that this set of ROIs are data driven and strictly symmetrical [5] , which is not the case for the set of ROIs from Power et al. [4] ). The bilaterally homologous regions tend to be strongly connected [6,7] , therefore, this set of ROIs can induce numerous one-step connections between mirror regions and result in reorganized local bridges in the network. As shown in Supplementary Fig. S4C, the correlation analysis showed that participation coefficient and reaction time was negatively correlated in the CN+ condition (Pearson's r = -0.51, p < 0.011), while the correlations were not significant in other conditions (all ps > 0.32). Table S3. Between-conditions significance in global and nodal metrics of brain networks based on the 263 ROIs [4] . The results were corrected with FDR across the range of densities (15%-22%). FDR corrected ps < 0.05 are highlighted in bold. Comparisons conducted on global metrics include the baseline and four speech processing conditions, while comparisons conducted on nodal metrics include the four speech processing conditions. processing conditions in modularity Q of brain networks based on the 263 ROIs [4] , in the range of density 2% -10%. Supplementary Figure S3. The spatial distribution of the set of 388 ROIs originally taken from Di et al. [5] . Notes: Each ROI is a sphere with a radius of 5mm (i.e., 81 voxels) and is intersected with the group-averaged gray matter mask to exclude areas that are outside the gray matter. The group left-vOT with 81 voxels (colored in red) is centered at MNI x = -47, y = -55, z = -17.

Supplementary Results 2
The analysis of community detection identified four sub-networks for both baseline and speech processing conditions. As mentioned in the Results section of the main text, the community structures were similar across conditions. We labeled these four sub-networks as visual network, fronto-parietal network, default mode network, and sensorimotor-auditory network based on the original community structure and labels defined by Power et al. [4] . As illustrated in Supplementary Fig. S5, the cyan and dark blue sub-networks highly matched the "visual network" and "default mode network" reported in Power et al. [4] , respectively. The green sub-network largely overlapped with the fronto-parietal task control, ventral and dorsal attention, and salience networks in Power et al. [4] and was labeled "fronto-parietal network" in our study. Finally, the magenta sub-network contained two somatosensory-motor networks (month and hand) and the auditory network reported in Power et al. [4] . Although it also overlapped with several subcortical networks (i.e., cingulo-opercular task control, subcortical, cerebellar), we labeled this sub-network as "Sensorimotor-auditory Network".
Supplementary Figure S5. Community structures (partitions) and labeling. The top five rows show the community structures identified for baseline and speech processing conditions, which mainly contain four sub-networks that are colored cyan (visual network), green (fronto-parietal network), dark blue (default mode network) and magenta (sensorimotor-auditory network). The bottom row shows the community structure defined by Power et al. [4] through meta-analysis, where white bars present ROIs that cannot be clearly assigned into a sub-network. The legend shows the labels originally defined by Power et al. [4] .

Supplementary Results 3
Group left-vOT defined by the visual localizer task. The pre-processed functional data obtained in the visual localizer task with a block-design were fitted into a general linear model (GLM) using AFNI. The same set of nuisance regressors as the spoken sentence processing tasks were included, that are 1) 24 head motion regressors: the six motion parameters, their temporal derivatives, and all their corresponding squared time series and, 2) 26 physiological regressors: the mean time-series and the first twelve principal components of white matter and of CSF which were extracted by using the aCompCor method, as well as the cosine-basis regressors estimated by fMRIPrep. Motion contaminated volumes were identified by using framewise displacement (FD) and were censored along with the prior volume if their FD > 0.5mm. On average, 1.8% of the volumes were censored for the visual localizer task. The group level analysis was then performed to compute the contrast wordsconsonant strings by using pairwise t-test in AFNI (3dttest++). To combine the group left-vOT with the set of 263 spherical ROIs [4] that are described in the main text, the group left-vOT with the same volume of 81 voxels as a spherical ROI was extracted. To this aim, the group activation map of the contrast words -consonant strings was thresholded to extract the strongest cluster with a volume of 81 voxels by adjusting the significance threshold (T > 5.2536, p < 2.5e-5). The spatial distribution of the final set of ROIs is shown in Supplementary Fig. S6.
Supplementary Figure S6. The spatial distribution of the set of 263 ROIs originally taken from Power et al. [4] . Notes: Each ROI is a sphere with a radius of 5mm (i.e., 81 voxels) and is intersected with the group-averaged gray matter mask to exclude areas that are outside the gray matter. The group left-vOT with 81 voxels (colored in red) is centered at MNI x = -47, y = -55, z = -17.